This case assumes that we are a real estate company that invests in large cities by buying properties to later rent them out as tourist apartments.
The management has made the decision to invest in Madrid, and has asked us to analyze the data that the industry leader AirBnb makes public to try to find the types of properties that have greater commercial potential for tourist rental.
The main deliverable is the property classification that the valuation team should look for among the existing opportunities in the city and the main neighborhoods or geographic areas to focus on.
Find the profile(s) of properties that maximize the commercial potential in the tourist rental market and the main areas to look for them.
In this project we use the following KPIs:
On the rental price:
On occupancy:
On the purchase price:
We will use real data offered by AirBnB: http://insideairbnb.com/get-the-data.html
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
%config IPCompleter.greedy=True
From following website, we can extract the data from AirBnB:
http://insideairbnb.com/get-the-data.html
The description of each table is below.
After an overview of all tables, we work just with listings.csv and listings.csv.gz
In this phase the data quality process is carried out and afterwards all data sources are put together in one table.
Load tables
listings = pd.read_csv('listings.csv')
listings_det = pd.read_csv('listings.csv.gz',compression='gzip')
In our data we do not have the purchase price of the properties. Therefore, we are going to look for this data externally.
https://www.immopreise.at/Wien/Wohnung/Eigentum
How to extract data from a website: https://github.com/blancaRE/ds_portfolio/tree/main/Web%20scrapping
price_m2 = pd.read_csv('immopreise2.csv')
price_m2
| preisZeilenTitel | waehrung | waehrung 2 | waehrung 3 | waehrung 4 | waehrung 5 | |
|---|---|---|---|---|---|---|
| 1.0 | Innere Stadt | € 19.510 | € 17.137 | € 16.919 | € 25.016 | € 21.774 |
| 2.0 | Leopoldstadt | € 6.747 | € 7.204 | € 7.898 | € 9.405 | € 7.460 |
| 3.0 | Landstraße | € 7.982 | € 7.694 | € 7.516 | € 10.784 | € 7.801 |
| 4.0 | Wieden | € 9.875 | € 10.608 | € 11.303 | € 13.693 | € 10.769 |
| 5.0 | Margareten | € 6.362 | € 6.033 | € 5.892 | € 9.461 | € 6.164 |
| 6.0 | Mariahilf | € 8.530 | € 8.096 | € 10.233 | € 11.570 | € 9.999 |
| 7.0 | Neubau | € 7.473 | € 5.773 | € 8.714 | € 9.087 | € 8.065 |
| 8.0 | Josefstadt | € 8.645 | € 9.355 | € 8.673 | € 12.225 | € 9.029 |
| 9.0 | Alsergrund | € 8.066 | € 7.823 | € 7.706 | € 17.733 | € 10.477 |
| 10.0 | Favoriten | € 6.741 | € 5.327 | € 5.150 | € 5.604 | € 6.121 |
| 11.0 | Simmering | € 6.470 | € 5.379 | € 5.373 | k.A. | € 5.609 |
| 12.0 | Meidling | € 5.649 | € 5.727 | € 6.096 | € 5.551 | € 5.778 |
| 13.0 | Hietzing | € 6.759 | € 7.394 | € 7.296 | € 8.934 | € 7.419 |
| 14.0 | Penzing | € 5.991 | € 6.683 | € 6.633 | € 7.797 | € 6.472 |
| 15.0 | Rudolfsheim-Fünfhaus | € 5.902 | € 6.745 | € 6.220 | € 8.132 | € 6.509 |
| 16.0 | Ottakring | € 5.590 | € 5.961 | € 5.418 | € 5.778 | € 5.587 |
| 17.0 | Hernals | € 5.257 | € 6.136 | € 7.077 | € 6.786 | € 6.314 |
| 18.0 | Währing | € 7.866 | € 8.157 | € 8.517 | € 10.404 | € 8.680 |
| 19.0 | Döbling | € 7.934 | € 8.521 | € 10.417 | € 10.954 | € 9.695 |
| 20.0 | Brigittenau | € 5.484 | € 5.679 | € 6.809 | k.A. | € 5.863 |
| 21.0 | Floridsdorf | € 7.025 | € 6.160 | € 6.639 | € 4.647 | € 6.526 |
| 22.0 | Donaustadt | € 7.560 | € 6.132 | € 7.350 | € 7.429 | € 7.084 |
| 23.0 | Liesing | € 6.288 | € 6.204 | € 5.679 | € 5.764 | € 6.070 |
We load the data and rename the columns.
price_m2 = pd.read_csv('immopreise2.csv') \
.rename(columns = {price_m2.columns[0]:'district',price_m2.columns[1]:'price_0_50',price_m2.columns[2]:'price_51_80' , \
price_m2.columns[3]:'price_81_129',price_m2.columns[4]:'price_130',price_m2.columns[5]:'price_avg_m2'})
price_m2
| district | price_0_50 | price_51_80 | price_81_129 | price_130 | price_avg_m2 | |
|---|---|---|---|---|---|---|
| 1.0 | Innere Stadt | € 19.510 | € 17.137 | € 16.919 | € 25.016 | € 21.774 |
| 2.0 | Leopoldstadt | € 6.747 | € 7.204 | € 7.898 | € 9.405 | € 7.460 |
| 3.0 | Landstraße | € 7.982 | € 7.694 | € 7.516 | € 10.784 | € 7.801 |
| 4.0 | Wieden | € 9.875 | € 10.608 | € 11.303 | € 13.693 | € 10.769 |
| 5.0 | Margareten | € 6.362 | € 6.033 | € 5.892 | € 9.461 | € 6.164 |
| 6.0 | Mariahilf | € 8.530 | € 8.096 | € 10.233 | € 11.570 | € 9.999 |
| 7.0 | Neubau | € 7.473 | € 5.773 | € 8.714 | € 9.087 | € 8.065 |
| 8.0 | Josefstadt | € 8.645 | € 9.355 | € 8.673 | € 12.225 | € 9.029 |
| 9.0 | Alsergrund | € 8.066 | € 7.823 | € 7.706 | € 17.733 | € 10.477 |
| 10.0 | Favoriten | € 6.741 | € 5.327 | € 5.150 | € 5.604 | € 6.121 |
| 11.0 | Simmering | € 6.470 | € 5.379 | € 5.373 | k.A. | € 5.609 |
| 12.0 | Meidling | € 5.649 | € 5.727 | € 6.096 | € 5.551 | € 5.778 |
| 13.0 | Hietzing | € 6.759 | € 7.394 | € 7.296 | € 8.934 | € 7.419 |
| 14.0 | Penzing | € 5.991 | € 6.683 | € 6.633 | € 7.797 | € 6.472 |
| 15.0 | Rudolfsheim-Fünfhaus | € 5.902 | € 6.745 | € 6.220 | € 8.132 | € 6.509 |
| 16.0 | Ottakring | € 5.590 | € 5.961 | € 5.418 | € 5.778 | € 5.587 |
| 17.0 | Hernals | € 5.257 | € 6.136 | € 7.077 | € 6.786 | € 6.314 |
| 18.0 | Währing | € 7.866 | € 8.157 | € 8.517 | € 10.404 | € 8.680 |
| 19.0 | Döbling | € 7.934 | € 8.521 | € 10.417 | € 10.954 | € 9.695 |
| 20.0 | Brigittenau | € 5.484 | € 5.679 | € 6.809 | k.A. | € 5.863 |
| 21.0 | Floridsdorf | € 7.025 | € 6.160 | € 6.639 | € 4.647 | € 6.526 |
| 22.0 | Donaustadt | € 7.560 | € 6.132 | € 7.350 | € 7.429 | € 7.084 |
| 23.0 | Liesing | € 6.288 | € 6.204 | € 5.679 | € 5.764 | € 6.070 |
price_m2.info()
<class 'pandas.core.frame.DataFrame'> Float64Index: 23 entries, 1.0 to 23.0 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 district 23 non-null object 1 price_0_50 23 non-null object 2 price_51_80 23 non-null object 3 price_81_129 23 non-null object 4 price_130 23 non-null object 5 price_avg_m2 23 non-null object dtypes: object(6) memory usage: 1.3+ KB
Replace the unknown values with the average price.
price_m2.price_130 = np.where(price_m2.price_130.str.contains('k.A.'), price_m2.price_avg_m2, price_m2.price_130 ,)
price_m2
| district | price_0_50 | price_51_80 | price_81_129 | price_130 | price_avg_m2 | |
|---|---|---|---|---|---|---|
| 1.0 | Innere Stadt | € 19.510 | € 17.137 | € 16.919 | € 25.016 | € 21.774 |
| 2.0 | Leopoldstadt | € 6.747 | € 7.204 | € 7.898 | € 9.405 | € 7.460 |
| 3.0 | Landstraße | € 7.982 | € 7.694 | € 7.516 | € 10.784 | € 7.801 |
| 4.0 | Wieden | € 9.875 | € 10.608 | € 11.303 | € 13.693 | € 10.769 |
| 5.0 | Margareten | € 6.362 | € 6.033 | € 5.892 | € 9.461 | € 6.164 |
| 6.0 | Mariahilf | € 8.530 | € 8.096 | € 10.233 | € 11.570 | € 9.999 |
| 7.0 | Neubau | € 7.473 | € 5.773 | € 8.714 | € 9.087 | € 8.065 |
| 8.0 | Josefstadt | € 8.645 | € 9.355 | € 8.673 | € 12.225 | € 9.029 |
| 9.0 | Alsergrund | € 8.066 | € 7.823 | € 7.706 | € 17.733 | € 10.477 |
| 10.0 | Favoriten | € 6.741 | € 5.327 | € 5.150 | € 5.604 | € 6.121 |
| 11.0 | Simmering | € 6.470 | € 5.379 | € 5.373 | € 5.609 | € 5.609 |
| 12.0 | Meidling | € 5.649 | € 5.727 | € 6.096 | € 5.551 | € 5.778 |
| 13.0 | Hietzing | € 6.759 | € 7.394 | € 7.296 | € 8.934 | € 7.419 |
| 14.0 | Penzing | € 5.991 | € 6.683 | € 6.633 | € 7.797 | € 6.472 |
| 15.0 | Rudolfsheim-Fünfhaus | € 5.902 | € 6.745 | € 6.220 | € 8.132 | € 6.509 |
| 16.0 | Ottakring | € 5.590 | € 5.961 | € 5.418 | € 5.778 | € 5.587 |
| 17.0 | Hernals | € 5.257 | € 6.136 | € 7.077 | € 6.786 | € 6.314 |
| 18.0 | Währing | € 7.866 | € 8.157 | € 8.517 | € 10.404 | € 8.680 |
| 19.0 | Döbling | € 7.934 | € 8.521 | € 10.417 | € 10.954 | € 9.695 |
| 20.0 | Brigittenau | € 5.484 | € 5.679 | € 6.809 | € 5.863 | € 5.863 |
| 21.0 | Floridsdorf | € 7.025 | € 6.160 | € 6.639 | € 4.647 | € 6.526 |
| 22.0 | Donaustadt | € 7.560 | € 6.132 | € 7.350 | € 7.429 | € 7.084 |
| 23.0 | Liesing | € 6.288 | € 6.204 | € 5.679 | € 5.764 | € 6.070 |
The prices variables are an objtect. it should be a numeric variable.
price_m2.iloc[:,5]
1.0 € 21.774 2.0 € 7.460 3.0 € 7.801 4.0 € 10.769 5.0 € 6.164 6.0 € 9.999 7.0 € 8.065 8.0 € 9.029 9.0 € 10.477 10.0 € 6.121 11.0 € 5.609 12.0 € 5.778 13.0 € 7.419 14.0 € 6.472 15.0 € 6.509 16.0 € 5.587 17.0 € 6.314 18.0 € 8.680 19.0 € 9.695 20.0 € 5.863 21.0 € 6.526 22.0 € 7.084 23.0 € 6.070 Name: price_avg_m2, dtype: object
price_m2.iloc[:,1] =price_m2.iloc[:,1].str.split(expand = True)[1].str.replace('.','',regex=False).astype('int')
price_m2.iloc[:,2] =price_m2.iloc[:,2].str.split(expand = True)[1].str.replace('.','',regex=False).astype('int')
price_m2.iloc[:,3] =price_m2.iloc[:,3].str.split(expand = True)[1].str.replace('.','',regex=False).astype('int')
price_m2.iloc[:,4] =price_m2.iloc[:,4].str.split(expand = True)[1].str.replace('.','',regex=False).astype('int')
price_m2.iloc[:,5] =price_m2.iloc[:,5].str.split(expand = True)[1].str.replace('.','',regex=False).astype('int')
price_m2
| district | price_0_50 | price_51_80 | price_81_129 | price_130 | price_avg_m2 | |
|---|---|---|---|---|---|---|
| 1.0 | Innere Stadt | 19510 | 17137 | 16919 | 25016 | 21774 |
| 2.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 |
| 3.0 | Landstraße | 7982 | 7694 | 7516 | 10784 | 7801 |
| 4.0 | Wieden | 9875 | 10608 | 11303 | 13693 | 10769 |
| 5.0 | Margareten | 6362 | 6033 | 5892 | 9461 | 6164 |
| 6.0 | Mariahilf | 8530 | 8096 | 10233 | 11570 | 9999 |
| 7.0 | Neubau | 7473 | 5773 | 8714 | 9087 | 8065 |
| 8.0 | Josefstadt | 8645 | 9355 | 8673 | 12225 | 9029 |
| 9.0 | Alsergrund | 8066 | 7823 | 7706 | 17733 | 10477 |
| 10.0 | Favoriten | 6741 | 5327 | 5150 | 5604 | 6121 |
| 11.0 | Simmering | 6470 | 5379 | 5373 | 5609 | 5609 |
| 12.0 | Meidling | 5649 | 5727 | 6096 | 5551 | 5778 |
| 13.0 | Hietzing | 6759 | 7394 | 7296 | 8934 | 7419 |
| 14.0 | Penzing | 5991 | 6683 | 6633 | 7797 | 6472 |
| 15.0 | Rudolfsheim-Fünfhaus | 5902 | 6745 | 6220 | 8132 | 6509 |
| 16.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 |
| 17.0 | Hernals | 5257 | 6136 | 7077 | 6786 | 6314 |
| 18.0 | Währing | 7866 | 8157 | 8517 | 10404 | 8680 |
| 19.0 | Döbling | 7934 | 8521 | 10417 | 10954 | 9695 |
| 20.0 | Brigittenau | 5484 | 5679 | 6809 | 5863 | 5863 |
| 21.0 | Floridsdorf | 7025 | 6160 | 6639 | 4647 | 6526 |
| 22.0 | Donaustadt | 7560 | 6132 | 7350 | 7429 | 7084 |
| 23.0 | Liesing | 6288 | 6204 | 5679 | 5764 | 6070 |
listings.head()
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15883 | b&b near Old Danube river | 62142 | Eva | NaN | Donaustadt | 48.24262 | 16.42767 | Hotel room | 120 | 1 | 14 | 2021-10-07 | 0.17 | 3 | 364 | 3 | NaN |
| 1 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Hannes | NaN | Leopoldstadt | 48.21924 | 16.37831 | Entire home/apt | 66 | 3 | 336 | 2021-11-14 | 2.57 | 3 | 124 | 13 | NaN |
| 2 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Ingela | NaN | Rudolfsheim-Fnfhaus | 48.18434 | 16.32701 | Entire home/apt | 156 | 1 | 162 | 2021-11-01 | 1.17 | 15 | 306 | 7 | NaN |
| 3 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Hannes | NaN | Leopoldstadt | 48.21778 | 16.37847 | Entire home/apt | 62 | 3 | 327 | 2021-11-22 | 2.47 | 3 | 136 | 18 | NaN |
| 4 | 70637 | Flat in the Center with Terrace | 358842 | Elxe | NaN | Leopoldstadt | 48.21760 | 16.38018 | Private room | 50 | 2 | 117 | 2021-06-25 | 0.90 | 3 | 333 | 1 | NaN |
listings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11429 entries, 0 to 11428 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11429 non-null int64 1 name 11416 non-null object 2 host_id 11429 non-null int64 3 host_name 11411 non-null object 4 neighbourhood_group 0 non-null float64 5 neighbourhood 11429 non-null object 6 latitude 11429 non-null float64 7 longitude 11429 non-null float64 8 room_type 11429 non-null object 9 price 11429 non-null int64 10 minimum_nights 11429 non-null int64 11 number_of_reviews 11429 non-null int64 12 last_review 9091 non-null object 13 reviews_per_month 9091 non-null float64 14 calculated_host_listings_count 11429 non-null int64 15 availability_365 11429 non-null int64 16 number_of_reviews_ltm 11429 non-null int64 17 license 0 non-null float64 dtypes: float64(5), int64(8), object(5) memory usage: 1.6+ MB
We will delete those variables that we do not directly need for our goal.
listings
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | number_of_reviews_ltm | license | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15883 | b&b near Old Danube river | 62142 | Eva | NaN | Donaustadt | 48.242620 | 16.427670 | Hotel room | 120 | 1 | 14 | 2021-10-07 | 0.17 | 3 | 364 | 3 | NaN |
| 1 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Hannes | NaN | Leopoldstadt | 48.219240 | 16.378310 | Entire home/apt | 66 | 3 | 336 | 2021-11-14 | 2.57 | 3 | 124 | 13 | NaN |
| 2 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Ingela | NaN | Rudolfsheim-Fnfhaus | 48.184340 | 16.327010 | Entire home/apt | 156 | 1 | 162 | 2021-11-01 | 1.17 | 15 | 306 | 7 | NaN |
| 3 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Hannes | NaN | Leopoldstadt | 48.217780 | 16.378470 | Entire home/apt | 62 | 3 | 327 | 2021-11-22 | 2.47 | 3 | 136 | 18 | NaN |
| 4 | 70637 | Flat in the Center with Terrace | 358842 | Elxe | NaN | Leopoldstadt | 48.217600 | 16.380180 | Private room | 50 | 2 | 117 | 2021-06-25 | 0.90 | 3 | 333 | 1 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11424 | 53711307 | Designer Apartment | 8 mins to U3 Ottakring | 64601861 | Sarah & Thomas | NaN | Ottakring | 48.208908 | 16.321311 | Entire home/apt | 69 | 4 | 0 | NaN | NaN | 5 | 139 | 0 | NaN |
| 11425 | 53713303 | XII-JNBP New Apartment top 12 | 120675973 | Ana | NaN | Ottakring | 48.213201 | 16.324996 | Entire home/apt | 69 | 2 | 0 | NaN | NaN | 6 | 171 | 0 | NaN |
| 11426 | 53715433 | Private 1BR with shared bathroom | 404652017 | Ali | NaN | Ottakring | 48.206001 | 16.328201 | Private room | 53 | 1 | 0 | NaN | NaN | 83 | 13 | 0 | NaN |
| 11427 | 53716202 | adorable 1BR with shared bathroom | 404652017 | Ali | NaN | Ottakring | 48.206153 | 16.328158 | Private room | 54 | 1 | 0 | NaN | NaN | 83 | 13 | 0 | NaN |
| 11428 | 53716591 | Schöne Suite near Wiener Stadthalle | 404652017 | Ali | NaN | Ottakring | 48.205422 | 16.327949 | Private room | 55 | 1 | 0 | NaN | NaN | 83 | 13 | 0 | NaN |
11429 rows × 18 columns
to_delete = ['host_name',
'neighbourhood_group',
'number_of_reviews',
'last_review',
'reviews_per_month',
'calculated_host_listings_count',
'number_of_reviews_ltm',
'license'
]
listings.drop(columns = to_delete, inplace=True)
listings
| id | name | host_id | neighbourhood | latitude | longitude | room_type | price | minimum_nights | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15883 | b&b near Old Danube river | 62142 | Donaustadt | 48.242620 | 16.427670 | Hotel room | 120 | 1 | 364 |
| 1 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Leopoldstadt | 48.219240 | 16.378310 | Entire home/apt | 66 | 3 | 124 |
| 2 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Rudolfsheim-Fnfhaus | 48.184340 | 16.327010 | Entire home/apt | 156 | 1 | 306 |
| 3 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Leopoldstadt | 48.217780 | 16.378470 | Entire home/apt | 62 | 3 | 136 |
| 4 | 70637 | Flat in the Center with Terrace | 358842 | Leopoldstadt | 48.217600 | 16.380180 | Private room | 50 | 2 | 333 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11424 | 53711307 | Designer Apartment | 8 mins to U3 Ottakring | 64601861 | Ottakring | 48.208908 | 16.321311 | Entire home/apt | 69 | 4 | 139 |
| 11425 | 53713303 | XII-JNBP New Apartment top 12 | 120675973 | Ottakring | 48.213201 | 16.324996 | Entire home/apt | 69 | 2 | 171 |
| 11426 | 53715433 | Private 1BR with shared bathroom | 404652017 | Ottakring | 48.206001 | 16.328201 | Private room | 53 | 1 | 13 |
| 11427 | 53716202 | adorable 1BR with shared bathroom | 404652017 | Ottakring | 48.206153 | 16.328158 | Private room | 54 | 1 | 13 |
| 11428 | 53716591 | Schöne Suite near Wiener Stadthalle | 404652017 | Ottakring | 48.205422 | 16.327949 | Private room | 55 | 1 | 13 |
11429 rows × 10 columns
Review the variable types
listings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11429 entries, 0 to 11428 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11429 non-null int64 1 name 11416 non-null object 2 host_id 11429 non-null int64 3 neighbourhood 11429 non-null object 4 latitude 11429 non-null float64 5 longitude 11429 non-null float64 6 room_type 11429 non-null object 7 price 11429 non-null int64 8 minimum_nights 11429 non-null int64 9 availability_365 11429 non-null int64 dtypes: float64(2), int64(5), object(3) memory usage: 893.0+ KB
Covert object to category (objects are not efficient, it takes up a lot of memory space)
for variable in ['neighbourhood','room_type']:
listings[variable] = listings[variable].astype('category')
Check
listings.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11429 entries, 0 to 11428 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11429 non-null int64 1 name 11416 non-null object 2 host_id 11429 non-null int64 3 neighbourhood 11429 non-null category 4 latitude 11429 non-null float64 5 longitude 11429 non-null float64 6 room_type 11429 non-null category 7 price 11429 non-null int64 8 minimum_nights 11429 non-null int64 9 availability_365 11429 non-null int64 dtypes: category(2), float64(2), int64(5), object(1) memory usage: 737.7+ KB
listings[listings.name.isna()]
| id | name | host_id | neighbourhood | latitude | longitude | room_type | price | minimum_nights | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 507 | 2394798 | NaN | 12233591 | Alsergrund | 48.22292 | 16.36113 | Entire home/apt | 1200 | 1 | 0 |
| 1131 | 6178943 | NaN | 32051578 | Margareten | 48.18741 | 16.36095 | Entire home/apt | 99 | 1 | 0 |
| 1160 | 6311354 | NaN | 32832294 | Whring | 48.22249 | 16.33692 | Entire home/apt | 150 | 1 | 0 |
| 1187 | 6367745 | NaN | 33182091 | Brigittenau | 48.22675 | 16.36616 | Private room | 35 | 2 | 0 |
| 1362 | 7169015 | NaN | 32315769 | Rudolfsheim-Fnfhaus | 48.20262 | 16.33052 | Private room | 28 | 1 | 0 |
| 1388 | 7316171 | NaN | 17394384 | Mariahilf | 48.19685 | 16.35082 | Entire home/apt | 70 | 1 | 0 |
| 1520 | 8114746 | NaN | 2492437 | Dbling | 48.25958 | 16.35840 | Entire home/apt | 85 | 4 | 0 |
| 1525 | 8142136 | NaN | 39991543 | Ottakring | 48.22227 | 16.31305 | Private room | 31 | 1 | 0 |
| 1673 | 9339553 | NaN | 11594706 | Leopoldstadt | 48.22600 | 16.38239 | Private room | 60 | 3 | 0 |
| 1842 | 10491736 | NaN | 34798203 | Whring | 48.22157 | 16.34073 | Private room | 58 | 1 | 0 |
| 2144 | 12868904 | NaN | 70346740 | Landstra§e | 48.20207 | 16.39775 | Entire home/apt | 45 | 3 | 0 |
| 2146 | 12881583 | NaN | 36155856 | Whring | 48.22871 | 16.34136 | Private room | 25 | 2 | 0 |
| 2195 | 13109834 | NaN | 40671924 | Mariahilf | 48.19094 | 16.34285 | Entire home/apt | 49 | 2 | 0 |
The name of the property is not a crucial variable. We leave it as it is.
Check for duplicate records
listings.duplicated().sum()
0
listings.neighbourhood.value_counts()
Leopoldstadt 1229 Landstra§e 1022 Rudolfsheim-Fnfhaus 862 Favoriten 691 Alsergrund 682 Neubau 678 Innere Stadt 622 Margareten 615 Ottakring 581 Mariahilf 531 Wieden 469 Meidling 457 Brigittenau 429 Josefstadt 428 Donaustadt 380 Whring 338 Hernals 337 Penzing 295 Dbling 246 Floridsdorf 186 Hietzing 141 Simmering 137 Liesing 73 Name: neighbourhood, dtype: int64
There are characters not porperly recognized. We will solve it later.
listings.room_type.value_counts()
Entire home/apt 8591 Private room 2694 Shared room 80 Hotel room 64 Name: room_type, dtype: int64
We realize that there are hotels. Our company does not plan to buy hotels, so we have to delete these records.
listings = listings.loc[listings.room_type != 'Hotel room']
listings.room_type.value_counts()
Entire home/apt 8591 Private room 2694 Shared room 80 Hotel room 0 Name: room_type, dtype: int64
listings.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 11365 entries, 1 to 11428 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11365 non-null int64 1 name 11352 non-null object 2 host_id 11365 non-null int64 3 neighbourhood 11365 non-null category 4 latitude 11365 non-null float64 5 longitude 11365 non-null float64 6 room_type 11365 non-null category 7 price 11365 non-null int64 8 minimum_nights 11365 non-null int64 9 availability_365 11365 non-null int64 dtypes: category(2), float64(2), int64(5), object(1) memory usage: 822.2+ KB
variables_2_analize= ['price', 'minimum_nights', 'availability_365']
listings.loc[:,variables_2_analize].describe().T
#listings_det.describe(include = 'number').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| price | 11365.0 | 82.000176 | 160.790406 | 9.0 | 40.0 | 60.0 | 90.0 | 9270.0 |
| minimum_nights | 11365.0 | 5.969996 | 24.342792 | 1.0 | 1.0 | 2.0 | 3.0 | 1125.0 |
| availability_365 | 11365.0 | 142.788297 | 143.494410 | 0.0 | 0.0 | 89.0 | 309.0 | 365.0 |
Conlcusions:
We review price lows and highs
listings.price.plot.kde();
Check highs
plt.figure(figsize=(16,8))
listings.price.loc[listings.price > 1000].value_counts().sort_index().plot.bar()
plt.xticks(size = 10);
Check data close to zero.
plt.figure(figsize=(16,8))
listings.price.loc[listings.price < 30].value_counts().sort_index().plot.bar()
plt.xticks(size = 10);
Conclusion:
listings = listings.loc[listings.price > 19]
listings
| id | name | host_id | neighbourhood | latitude | longitude | room_type | price | minimum_nights | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Leopoldstadt | 48.219240 | 16.378310 | Entire home/apt | 66 | 3 | 124 |
| 2 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Rudolfsheim-Fnfhaus | 48.184340 | 16.327010 | Entire home/apt | 156 | 1 | 306 |
| 3 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Leopoldstadt | 48.217780 | 16.378470 | Entire home/apt | 62 | 3 | 136 |
| 4 | 70637 | Flat in the Center with Terrace | 358842 | Leopoldstadt | 48.217600 | 16.380180 | Private room | 50 | 2 | 333 |
| 5 | 75471 | nice big apartment with balcony | 363315 | Ottakring | 48.222070 | 16.315940 | Entire home/apt | 77 | 3 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 11424 | 53711307 | Designer Apartment | 8 mins to U3 Ottakring | 64601861 | Ottakring | 48.208908 | 16.321311 | Entire home/apt | 69 | 4 | 139 |
| 11425 | 53713303 | XII-JNBP New Apartment top 12 | 120675973 | Ottakring | 48.213201 | 16.324996 | Entire home/apt | 69 | 2 | 171 |
| 11426 | 53715433 | Private 1BR with shared bathroom | 404652017 | Ottakring | 48.206001 | 16.328201 | Private room | 53 | 1 | 13 |
| 11427 | 53716202 | adorable 1BR with shared bathroom | 404652017 | Ottakring | 48.206153 | 16.328158 | Private room | 54 | 1 | 13 |
| 11428 | 53716591 | Schöne Suite near Wiener Stadthalle | 404652017 | Ottakring | 48.205422 | 16.327949 | Private room | 55 | 1 | 13 |
10959 rows × 10 columns
listings_det.head()
| id | listing_url | scrape_id | last_scraped | name | description | neighborhood_overview | picture_url | host_id | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed | latitude | longitude | property_type | room_type | accommodates | bathrooms | bathrooms_text | bedrooms | beds | amenities | price | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | calendar_updated | has_availability | availability_30 | availability_60 | availability_90 | availability_365 | calendar_last_scraped | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | first_review | last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | license | instant_bookable | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15883 | https://www.airbnb.com/rooms/15883 | 20211208225152 | 2021-12-09 | b&b near Old Danube river | Four rooms, each one differently and individua... | small and personal<br /><br />Four rooms at th... | https://a0.muscache.com/pictures/18eff738-a737... | 62142 | https://www.airbnb.com/users/show/62142 | Eva | 2009-12-11 | Vienna, Wien, Austria | Mein größtes Hobby: Reisen! Am liebsten mit me... | within an hour | 100% | 100% | f | https://a0.muscache.com/im/pictures/user/24166... | https://a0.muscache.com/im/pictures/user/24166... | Donaustadt | 6.0 | 6.0 | ['email', 'phone', 'facebook', 'reviews', 'jum... | t | t | Vienna, Austria | Donaustadt | NaN | 48.24262 | 16.42767 | Room in bed and breakfast | Hotel room | 3 | NaN | 1 private bath | 1.0 | 2.0 | ["Pack \u2019n play/Travel crib", "Bed linens"... | $120.00 | 1 | 365 | 1 | 1 | 365 | 365 | 1.0 | 365.0 | NaN | t | 29 | 59 | 89 | 364 | 2021-12-09 | 14 | 3 | 0 | 2015-04-10 | 2021-10-07 | 4.71 | 4.86 | 4.93 | 4.93 | 4.86 | 4.71 | 4.50 | NaN | f | 3 | 1 | 0 | 0 | 0.17 |
| 1 | 38768 | https://www.airbnb.com/rooms/38768 | 20211208225152 | 2021-12-09 | central cityapartement- wifi- nice neighbourhood | 39m² apartment with beautiful courtyard of the... | the Karmeliterviertel became very popular in t... | https://a0.muscache.com/pictures/ad4089a3-5355... | 166283 | https://www.airbnb.com/users/show/166283 | Hannes | 2010-07-14 | Wien, Wien, Austria | I am open minded and like travelling myself. I... | NaN | NaN | NaN | t | https://a0.muscache.com/im/users/166283/profil... | https://a0.muscache.com/im/users/166283/profil... | Leopoldstadt | 1.0 | 1.0 | ['email', 'phone', 'reviews', 'jumio', 'offlin... | t | t | Vienna, Austria | Leopoldstadt | NaN | 48.21924 | 16.37831 | Entire rental unit | Entire home/apt | 5 | NaN | 1 bath | 1.0 | 3.0 | ["Washer", "Dishes and silverware", "Heating",... | $66.00 | 3 | 100 | 3 | 28 | 1125 | 1125 | 7.0 | 1125.0 | NaN | t | 16 | 23 | 34 | 124 | 2021-12-09 | 336 | 13 | 2 | 2011-03-23 | 2021-11-14 | 4.75 | 4.80 | 4.65 | 4.91 | 4.93 | 4.74 | 4.70 | NaN | t | 3 | 3 | 0 | 0 | 2.57 |
| 2 | 40625 | https://www.airbnb.com/rooms/40625 | 20211208225152 | 2021-12-09 | Near Palace Schönbrunn, Apt. 1 | Welcome to my Apt. 1!<br /><br />This is a 2be... | The neighbourhood offers plenty of restaurants... | https://a0.muscache.com/pictures/11509144/d55c... | 175131 | https://www.airbnb.com/users/show/175131 | Ingela | 2010-07-20 | Vienna, Wien, Austria | I´m originally from Sweden but have been livin... | within a day | 97% | 79% | t | https://a0.muscache.com/im/users/175131/profil... | https://a0.muscache.com/im/users/175131/profil... | Rudolfsheim-Fünfhaus | 16.0 | 16.0 | ['email', 'phone', 'reviews', 'jumio', 'govern... | t | t | Vienna, Austria | Rudolfsheim-Fnfhaus | NaN | 48.18434 | 16.32701 | Entire rental unit | Entire home/apt | 6 | NaN | 1 bath | 2.0 | 4.0 | ["Dishes and silverware", "Cable TV", "Dedicat... | $156.00 | 1 | 180 | 1 | 3 | 180 | 180 | 1.0 | 180.0 | NaN | t | 0 | 6 | 36 | 306 | 2021-12-09 | 162 | 7 | 0 | 2010-08-04 | 2021-11-01 | 4.84 | 4.91 | 4.87 | 4.90 | 4.93 | 4.59 | 4.73 | NaN | f | 15 | 14 | 1 | 0 | 1.17 |
| 3 | 51287 | https://www.airbnb.com/rooms/51287 | 20211208225152 | 2021-12-09 | little studio- next to citycenter- wifi- nice ... | small studio in new renovated old house and ve... | The neighbourhood has a lot of very nice littl... | https://a0.muscache.com/pictures/25163038/1c4e... | 166283 | https://www.airbnb.com/users/show/166283 | Hannes | 2010-07-14 | Wien, Wien, Austria | I am open minded and like travelling myself. I... | NaN | NaN | NaN | t | https://a0.muscache.com/im/users/166283/profil... | https://a0.muscache.com/im/users/166283/profil... | Leopoldstadt | 1.0 | 1.0 | ['email', 'phone', 'reviews', 'jumio', 'offlin... | t | t | Vienna, Austria | Leopoldstadt | NaN | 48.21778 | 16.37847 | Entire rental unit | Entire home/apt | 3 | NaN | 1 bath | NaN | 2.0 | ["Washer", "Dishes and silverware", "Heating",... | $62.00 | 3 | 100 | 3 | 28 | 1125 | 1125 | 9.1 | 1125.0 | NaN | t | 15 | 45 | 75 | 136 | 2021-12-09 | 327 | 18 | 3 | 2011-01-27 | 2021-11-22 | 4.64 | 4.76 | 4.52 | 4.92 | 4.95 | 4.86 | 4.56 | NaN | t | 3 | 3 | 0 | 0 | 2.47 |
| 4 | 70637 | https://www.airbnb.com/rooms/70637 | 20211208225152 | 2021-12-09 | Flat in the Center with Terrace | <b>The space</b><br />My apartment (including ... | NaN | https://a0.muscache.com/pictures/925691/c8c1bd... | 358842 | https://www.airbnb.com/users/show/358842 | Elxe | 2011-01-23 | Vienna, Vienna, Austria | Flat in the Center with TerraceWien, Wien, Öst... | within an hour | 100% | 76% | t | https://a0.muscache.com/im/users/358842/profil... | https://a0.muscache.com/im/users/358842/profil... | Leopoldstadt | 3.0 | 3.0 | ['email', 'phone', 'facebook', 'reviews', 'off... | t | t | NaN | Leopoldstadt | NaN | 48.21760 | 16.38018 | Private room in rental unit | Private room | 2 | NaN | 2 shared baths | 1.0 | 2.0 | ["Washer", "Dishes and silverware", "BBQ grill... | $50.00 | 2 | 1000 | 2 | 2 | 1000 | 1000 | 2.0 | 1000.0 | NaN | t | 0 | 28 | 58 | 333 | 2021-12-09 | 117 | 1 | 0 | 2011-03-28 | 2021-06-25 | 4.77 | 4.74 | 4.68 | 4.80 | 4.75 | 4.81 | 4.71 | NaN | f | 3 | 1 | 2 | 0 | 0.90 |
listings_det.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11429 entries, 0 to 11428 Data columns (total 74 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11429 non-null int64 1 listing_url 11429 non-null object 2 scrape_id 11429 non-null int64 3 last_scraped 11429 non-null object 4 name 11416 non-null object 5 description 11097 non-null object 6 neighborhood_overview 6229 non-null object 7 picture_url 11429 non-null object 8 host_id 11429 non-null int64 9 host_url 11429 non-null object 10 host_name 11411 non-null object 11 host_since 11411 non-null object 12 host_location 11393 non-null object 13 host_about 6086 non-null object 14 host_response_time 6516 non-null object 15 host_response_rate 6516 non-null object 16 host_acceptance_rate 6861 non-null object 17 host_is_superhost 11411 non-null object 18 host_thumbnail_url 11411 non-null object 19 host_picture_url 11411 non-null object 20 host_neighbourhood 8063 non-null object 21 host_listings_count 11411 non-null float64 22 host_total_listings_count 11411 non-null float64 23 host_verifications 11429 non-null object 24 host_has_profile_pic 11411 non-null object 25 host_identity_verified 11411 non-null object 26 neighbourhood 6229 non-null object 27 neighbourhood_cleansed 11429 non-null object 28 neighbourhood_group_cleansed 0 non-null float64 29 latitude 11429 non-null float64 30 longitude 11429 non-null float64 31 property_type 11429 non-null object 32 room_type 11429 non-null object 33 accommodates 11429 non-null int64 34 bathrooms 0 non-null float64 35 bathrooms_text 11416 non-null object 36 bedrooms 10084 non-null float64 37 beds 10976 non-null float64 38 amenities 11429 non-null object 39 price 11429 non-null object 40 minimum_nights 11429 non-null int64 41 maximum_nights 11429 non-null int64 42 minimum_minimum_nights 11429 non-null int64 43 maximum_minimum_nights 11429 non-null int64 44 minimum_maximum_nights 11429 non-null int64 45 maximum_maximum_nights 11429 non-null int64 46 minimum_nights_avg_ntm 11429 non-null float64 47 maximum_nights_avg_ntm 11429 non-null float64 48 calendar_updated 0 non-null float64 49 has_availability 11429 non-null object 50 availability_30 11429 non-null int64 51 availability_60 11429 non-null int64 52 availability_90 11429 non-null int64 53 availability_365 11429 non-null int64 54 calendar_last_scraped 11429 non-null object 55 number_of_reviews 11429 non-null int64 56 number_of_reviews_ltm 11429 non-null int64 57 number_of_reviews_l30d 11429 non-null int64 58 first_review 9091 non-null object 59 last_review 9091 non-null object 60 review_scores_rating 9091 non-null float64 61 review_scores_accuracy 8983 non-null float64 62 review_scores_cleanliness 8983 non-null float64 63 review_scores_checkin 8982 non-null float64 64 review_scores_communication 8984 non-null float64 65 review_scores_location 8982 non-null float64 66 review_scores_value 8981 non-null float64 67 license 0 non-null float64 68 instant_bookable 11429 non-null object 69 calculated_host_listings_count 11429 non-null int64 70 calculated_host_listings_count_entire_homes 11429 non-null int64 71 calculated_host_listings_count_private_rooms 11429 non-null int64 72 calculated_host_listings_count_shared_rooms 11429 non-null int64 73 reviews_per_month 9091 non-null float64 dtypes: float64(20), int64(21), object(33) memory usage: 6.5+ MB
We select the variables relevant for our goal
to_keep = ['id',
'description',
'accommodates',
'bathrooms',
'bedrooms',
'beds'
]
listings_det = listings_det.loc[:,to_keep]
listings_det
| id | description | accommodates | bathrooms | bedrooms | beds | |
|---|---|---|---|---|---|---|
| 0 | 15883 | Four rooms, each one differently and individua... | 3 | NaN | 1.0 | 2.0 |
| 1 | 38768 | 39m² apartment with beautiful courtyard of the... | 5 | NaN | 1.0 | 3.0 |
| 2 | 40625 | Welcome to my Apt. 1!<br /><br />This is a 2be... | 6 | NaN | 2.0 | 4.0 |
| 3 | 51287 | small studio in new renovated old house and ve... | 3 | NaN | NaN | 2.0 |
| 4 | 70637 | <b>The space</b><br />My apartment (including ... | 2 | NaN | 1.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 11424 | 53711307 | ONLY BUSINESS TRAVELLERS ARE ALLOWED UNTIL DEC... | 2 | NaN | 1.0 | 1.0 |
| 11425 | 53713303 | The Apartment has a spacious living room with ... | 4 | NaN | 1.0 | 2.0 |
| 11426 | 53715433 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | NaN | 1.0 | 1.0 |
| 11427 | 53716202 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | NaN | 1.0 | 1.0 |
| 11428 | 53716591 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | NaN | 1.0 | 1.0 |
11429 rows × 6 columns
Variable type analisys
listings_det.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11429 entries, 0 to 11428 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11429 non-null int64 1 description 11097 non-null object 2 accommodates 11429 non-null int64 3 bathrooms 0 non-null float64 4 bedrooms 10084 non-null float64 5 beds 10976 non-null float64 dtypes: float64(3), int64(2), object(1) memory usage: 535.9+ KB
listings_det.isna().sum()
id 0 description 332 accommodates 0 bathrooms 11429 bedrooms 1345 beds 453 dtype: int64
Conclusions:
First, we estimate the number of beds and afterwards the number of bedrooms.
Try to make an assignment of beds number based on the number of people that can be accommodated.
pd.crosstab(listings_det.beds, listings_det.accommodates)
| accommodates | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| beds | ||||||||||||||||
| 1.0 | 570 | 3756 | 445 | 595 | 28 | 20 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2.0 | 36 | 674 | 779 | 1688 | 100 | 140 | 3 | 11 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3.0 | 5 | 38 | 131 | 287 | 236 | 406 | 20 | 29 | 1 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4.0 | 3 | 8 | 14 | 188 | 66 | 130 | 41 | 63 | 7 | 4 | 0 | 0 | 0 | 1 | 0 | 0 |
| 5.0 | 0 | 0 | 2 | 10 | 53 | 25 | 18 | 33 | 10 | 34 | 1 | 4 | 0 | 2 | 0 | 0 |
| 6.0 | 0 | 0 | 1 | 5 | 3 | 71 | 9 | 14 | 4 | 14 | 3 | 5 | 1 | 0 | 0 | 1 |
| 7.0 | 0 | 0 | 0 | 0 | 2 | 1 | 16 | 8 | 5 | 6 | 2 | 1 | 0 | 3 | 0 | 1 |
| 8.0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 14 | 0 | 4 | 1 | 7 | 0 | 2 | 1 | 5 |
| 9.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 1 |
| 10.0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 7 | 0 | 0 | 0 | 2 | 0 | 0 |
| 11.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 12.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 |
| 13.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 14.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 15.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 16.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| 17.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 18.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 28.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Observing the abova table, we perform the following approcah:
Null values count and frequency of each value
listings_det['beds'].value_counts(dropna = False)
1.0 5418 2.0 3432 3.0 1160 4.0 525 NaN 453 5.0 192 6.0 131 7.0 45 8.0 38 10.0 12 9.0 6 16.0 4 12.0 3 11.0 2 13.0 2 18.0 2 17.0 1 15.0 1 14.0 1 28.0 1 Name: beds, dtype: int64
We create a function to set the nulls of beds depending on accommodates.
def estimate_nulls_beds(record):
conditions = [(record.accommodates <= 2),
(record.accommodates > 2) & (record.accommodates <= 4),
(record.accommodates > 4) & (record.accommodates <= 6),
(record.accommodates > 6) & (record.accommodates <= 8),
(record.accommodates > 8)]
results = [1,2,3,4,5]
return(np.select(conditions,results, default = -999))
listings_det.loc[listings_det.beds.isna(),'beds'] = listings_det.loc[listings_det.beds.isna()].apply(estimate_nulls_beds, axis = 1).astype('float64')
Check
listings_det.beds.value_counts(dropna = False)
1.0 5718 2.0 3545 3.0 1189 4.0 532 5.0 196 6.0 131 7.0 45 8.0 38 10.0 12 9.0 6 16.0 4 12.0 3 11.0 2 13.0 2 18.0 2 17.0 1 15.0 1 14.0 1 28.0 1 Name: beds, dtype: int64
We estimate the number of bedrooms. A priori, we could estimate it with either the variable accomodates or beds.
Since the appartament owner may decide to place 10 people in a single room, we consider variable beds to be more reliable
pd.crosstab(listings_det.bedrooms, listings_det.beds, dropna=False)
| beds | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | 9.0 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 | 15.0 | 16.0 | 17.0 | 18.0 | 28.0 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bedrooms | |||||||||||||||||||
| 1.0 | 4723 | 2264 | 382 | 145 | 34 | 13 | 2 | 5 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2.0 | 65 | 883 | 566 | 247 | 66 | 52 | 9 | 9 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3.0 | 9 | 15 | 193 | 91 | 58 | 46 | 22 | 5 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 4.0 | 0 | 0 | 1 | 21 | 26 | 14 | 4 | 11 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 5.0 | 4 | 1 | 0 | 0 | 6 | 4 | 4 | 3 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 |
| 6.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7.0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 8.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10.0 | 20 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 12.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14.0 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 18.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 19.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
We perform the following assigment:
listings_det.bedrooms.value_counts(dropna = False)
1.0 7574 2.0 1901 NaN 1345 3.0 445 4.0 81 5.0 27 10.0 22 14.0 14 6.0 8 8.0 4 7.0 3 18.0 1 9.0 1 12.0 1 19.0 1 15.0 1 Name: bedrooms, dtype: int64
We create and execute the function to assign values to nulls:
def estimate_nulls_bedrooms(record):
conditions = [(record.beds <= 2), # on or two ---> 1
(record.beds > 2) & (record.beds <= 4), # three or four --> 2
(record.beds > 4) & (record.beds <= 6), # five or six --> 2,3
(record.beds > 6) & (record.beds <= 7), # seven --> 3
(record.beds > 7)] # more than seven --> 4
results = [1,2,np.random.choice([2,3] ,p=[0.5, 0.5]),3,4]
return(np.select(conditions,results, default = -999))
return(np.select(condiciones,resultados, default = -999))
listings_det.loc[listings_det.bedrooms.isna(),'bedrooms'] = listings_det.loc[listings_det.bedrooms.isna()].apply(estimate_nulls_bedrooms, axis = 1).astype('float64')
Check
listings_det.bedrooms.value_counts(dropna = False)
1.0 8837 2.0 1978 3.0 450 4.0 81 5.0 27 10.0 22 14.0 14 6.0 8 8.0 4 7.0 3 18.0 1 9.0 1 12.0 1 19.0 1 15.0 1 Name: bedrooms, dtype: int64
We delete the bathrooms variable.
listings_det.drop(columns = 'bathrooms', inplace = True)
listings_det
| id | description | accommodates | bedrooms | beds | |
|---|---|---|---|---|---|
| 0 | 15883 | Four rooms, each one differently and individua... | 3 | 1.0 | 2.0 |
| 1 | 38768 | 39m² apartment with beautiful courtyard of the... | 5 | 1.0 | 3.0 |
| 2 | 40625 | Welcome to my Apt. 1!<br /><br />This is a 2be... | 6 | 2.0 | 4.0 |
| 3 | 51287 | small studio in new renovated old house and ve... | 3 | 1.0 | 2.0 |
| 4 | 70637 | <b>The space</b><br />My apartment (including ... | 2 | 1.0 | 2.0 |
| ... | ... | ... | ... | ... | ... |
| 11424 | 53711307 | ONLY BUSINESS TRAVELLERS ARE ALLOWED UNTIL DEC... | 2 | 1.0 | 1.0 |
| 11425 | 53713303 | The Apartment has a spacious living room with ... | 4 | 1.0 | 2.0 |
| 11426 | 53715433 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
| 11427 | 53716202 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
| 11428 | 53716591 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
11429 rows × 5 columns
We check if there are any duplicate records
listings_det.duplicated().sum()
0
listings_det.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11429 entries, 0 to 11428 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11429 non-null int64 1 description 11097 non-null object 2 accommodates 11429 non-null int64 3 bedrooms 11429 non-null float64 4 beds 11429 non-null float64 dtypes: float64(2), int64(2), object(1) memory usage: 446.6+ KB
listings_det.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11429 entries, 0 to 11428 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 11429 non-null int64 1 description 11097 non-null object 2 accommodates 11429 non-null int64 3 bedrooms 11429 non-null float64 4 beds 11429 non-null float64 dtypes: float64(2), int64(2), object(1) memory usage: 446.6+ KB
listings_det.describe(include = 'number').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 11429.0 | 2.947266e+07 | 1.594658e+07 | 15883.0 | 16341520.0 | 30545339.0 | 43257279.0 | 53716591.0 |
| accommodates | 11429.0 | 3.263103e+00 | 1.818695e+00 | 0.0 | 2.0 | 3.0 | 4.0 | 16.0 |
| bedrooms | 11429.0 | 1.329250e+00 | 9.043333e-01 | 1.0 | 1.0 | 1.0 | 1.0 | 19.0 |
| beds | 11429.0 | 1.865430e+00 | 1.304828e+00 | 1.0 | 1.0 | 1.0 | 2.0 | 28.0 |
No unusual findings.
We have two tables
Both share the id field, so we can merge them.
Aditionally, we have the price_m2 table that we should merge according to the district variable.
listings.shape
(10959, 10)
listings_det.shape
(11429, 5)
df = pd.merge(left = listings, right = listings_det, how = 'left', on = 'id')
df
| id | name | host_id | neighbourhood | latitude | longitude | room_type | price | minimum_nights | availability_365 | description | accommodates | bedrooms | beds | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Leopoldstadt | 48.219240 | 16.378310 | Entire home/apt | 66 | 3 | 124 | 39m² apartment with beautiful courtyard of the... | 5 | 1.0 | 3.0 |
| 1 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Rudolfsheim-Fnfhaus | 48.184340 | 16.327010 | Entire home/apt | 156 | 1 | 306 | Welcome to my Apt. 1!<br /><br />This is a 2be... | 6 | 2.0 | 4.0 |
| 2 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Leopoldstadt | 48.217780 | 16.378470 | Entire home/apt | 62 | 3 | 136 | small studio in new renovated old house and ve... | 3 | 1.0 | 2.0 |
| 3 | 70637 | Flat in the Center with Terrace | 358842 | Leopoldstadt | 48.217600 | 16.380180 | Private room | 50 | 2 | 333 | <b>The space</b><br />My apartment (including ... | 2 | 1.0 | 2.0 |
| 4 | 75471 | nice big apartment with balcony | 363315 | Ottakring | 48.222070 | 16.315940 | Entire home/apt | 77 | 3 | 0 | you will like my beautiful apartment with balc... | 4 | 2.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10954 | 53711307 | Designer Apartment | 8 mins to U3 Ottakring | 64601861 | Ottakring | 48.208908 | 16.321311 | Entire home/apt | 69 | 4 | 139 | ONLY BUSINESS TRAVELLERS ARE ALLOWED UNTIL DEC... | 2 | 1.0 | 1.0 |
| 10955 | 53713303 | XII-JNBP New Apartment top 12 | 120675973 | Ottakring | 48.213201 | 16.324996 | Entire home/apt | 69 | 2 | 171 | The Apartment has a spacious living room with ... | 4 | 1.0 | 2.0 |
| 10956 | 53715433 | Private 1BR with shared bathroom | 404652017 | Ottakring | 48.206001 | 16.328201 | Private room | 53 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
| 10957 | 53716202 | adorable 1BR with shared bathroom | 404652017 | Ottakring | 48.206153 | 16.328158 | Private room | 54 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
| 10958 | 53716591 | Schöne Suite near Wiener Stadthalle | 404652017 | Ottakring | 48.205422 | 16.327949 | Private room | 55 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
10959 rows × 14 columns
Now let's see how we can incorporate the external information of the price per square meter.
To do this the first step is to analyze the values of the variable district in both tables, since they need to match so that we can cross them.
district1 = pd.Series(df.neighbourhood.unique().categories).sort_values()
district1
0 Alsergrund 1 Brigittenau 2 Donaustadt 3 Dbling 4 Favoriten 5 Floridsdorf 6 Hernals 7 Hietzing 8 Innere Stadt 9 Josefstadt 10 Landstra§e 11 Leopoldstadt 12 Liesing 13 Margareten 14 Mariahilf 15 Meidling 16 Neubau 17 Ottakring 18 Penzing 19 Rudolfsheim-Fnfhaus 20 Simmering 21 Wieden 22 Whring dtype: object
district2 = price_m2.district
district2
1.0 Innere Stadt 2.0 Leopoldstadt 3.0 Landstraße 4.0 Wieden 5.0 Margareten 6.0 Mariahilf 7.0 Neubau 8.0 Josefstadt 9.0 Alsergrund 10.0 Favoriten 11.0 Simmering 12.0 Meidling 13.0 Hietzing 14.0 Penzing 15.0 Rudolfsheim-Fünfhaus 16.0 Ottakring 17.0 Hernals 18.0 Währing 19.0 Döbling 20.0 Brigittenau 21.0 Floridsdorf 22.0 Donaustadt 23.0 Liesing Name: district, dtype: object
There are some differences in the german characters.
def map_names(record):
if 'Rudolfsheim' in record:
record= 'Rudolfsheim-Fünfhaus'
elif 'bling'in record:
record= 'Döbling'
elif 'hring'in record:
record= 'Währing'
elif 'Land'in record:
record= 'Landstraße'
else:
record=record
return record
df.neighbourhood=df.neighbourhood.map(map_names)
df
| id | name | host_id | neighbourhood | latitude | longitude | room_type | price | minimum_nights | availability_365 | description | accommodates | bedrooms | beds | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Leopoldstadt | 48.219240 | 16.378310 | Entire home/apt | 66 | 3 | 124 | 39m² apartment with beautiful courtyard of the... | 5 | 1.0 | 3.0 |
| 1 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Rudolfsheim-Fünfhaus | 48.184340 | 16.327010 | Entire home/apt | 156 | 1 | 306 | Welcome to my Apt. 1!<br /><br />This is a 2be... | 6 | 2.0 | 4.0 |
| 2 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Leopoldstadt | 48.217780 | 16.378470 | Entire home/apt | 62 | 3 | 136 | small studio in new renovated old house and ve... | 3 | 1.0 | 2.0 |
| 3 | 70637 | Flat in the Center with Terrace | 358842 | Leopoldstadt | 48.217600 | 16.380180 | Private room | 50 | 2 | 333 | <b>The space</b><br />My apartment (including ... | 2 | 1.0 | 2.0 |
| 4 | 75471 | nice big apartment with balcony | 363315 | Ottakring | 48.222070 | 16.315940 | Entire home/apt | 77 | 3 | 0 | you will like my beautiful apartment with balc... | 4 | 2.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10954 | 53711307 | Designer Apartment | 8 mins to U3 Ottakring | 64601861 | Ottakring | 48.208908 | 16.321311 | Entire home/apt | 69 | 4 | 139 | ONLY BUSINESS TRAVELLERS ARE ALLOWED UNTIL DEC... | 2 | 1.0 | 1.0 |
| 10955 | 53713303 | XII-JNBP New Apartment top 12 | 120675973 | Ottakring | 48.213201 | 16.324996 | Entire home/apt | 69 | 2 | 171 | The Apartment has a spacious living room with ... | 4 | 1.0 | 2.0 |
| 10956 | 53715433 | Private 1BR with shared bathroom | 404652017 | Ottakring | 48.206001 | 16.328201 | Private room | 53 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
| 10957 | 53716202 | adorable 1BR with shared bathroom | 404652017 | Ottakring | 48.206153 | 16.328158 | Private room | 54 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
| 10958 | 53716591 | Schöne Suite near Wiener Stadthalle | 404652017 | Ottakring | 48.205422 | 16.327949 | Private room | 55 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 |
10959 rows × 14 columns
df.neighbourhood.value_counts(dropna= 'False') # df[df.neighbourhood.isna()]
Leopoldstadt 1186 Landstraße 985 Rudolfsheim-Fünfhaus 822 Alsergrund 664 Favoriten 653 Neubau 648 Innere Stadt 606 Margareten 591 Ottakring 557 Mariahilf 517 Wieden 451 Meidling 448 Brigittenau 410 Josefstadt 403 Donaustadt 372 Währing 315 Hernals 304 Penzing 278 Döbling 231 Floridsdorf 178 Hietzing 136 Simmering 132 Liesing 72 Name: neighbourhood, dtype: int64
price_m2.district.unique()
array([' Innere Stadt', ' Leopoldstadt', ' Landstraße', ' Wieden',
' Margareten', ' Mariahilf', ' Neubau', ' Josefstadt',
' Alsergrund', ' Favoriten', ' Simmering', ' Meidling',
' Hietzing', ' Penzing', ' Rudolfsheim-Fünfhaus', ' Ottakring',
' Hernals', ' Währing', ' Döbling', ' Brigittenau',
' Floridsdorf', ' Donaustadt', ' Liesing'], dtype=object)
We delete the spaces.
price_m2.district=price_m2.district.str.lstrip()
price_m2.district.unique()
array(['Innere Stadt', 'Leopoldstadt', 'Landstraße', 'Wieden',
'Margareten', 'Mariahilf', 'Neubau', 'Josefstadt', 'Alsergrund',
'Favoriten', 'Simmering', 'Meidling', 'Hietzing', 'Penzing',
'Rudolfsheim-Fünfhaus', 'Ottakring', 'Hernals', 'Währing',
'Döbling', 'Brigittenau', 'Floridsdorf', 'Donaustadt', 'Liesing'],
dtype=object)
Now we can integrate it to the main table.
df = pd.merge(left = df, right = price_m2, how = 'left', left_on='neighbourhood', right_on='district')
df
| id | name | host_id | neighbourhood | latitude | longitude | room_type | price | minimum_nights | availability_365 | description | accommodates | bedrooms | beds | district | price_0_50 | price_51_80 | price_81_129 | price_130 | price_avg_m2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Leopoldstadt | 48.219240 | 16.378310 | Entire home/apt | 66 | 3 | 124 | 39m² apartment with beautiful courtyard of the... | 5 | 1.0 | 3.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 |
| 1 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Rudolfsheim-Fünfhaus | 48.184340 | 16.327010 | Entire home/apt | 156 | 1 | 306 | Welcome to my Apt. 1!<br /><br />This is a 2be... | 6 | 2.0 | 4.0 | Rudolfsheim-Fünfhaus | 5902 | 6745 | 6220 | 8132 | 6509 |
| 2 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Leopoldstadt | 48.217780 | 16.378470 | Entire home/apt | 62 | 3 | 136 | small studio in new renovated old house and ve... | 3 | 1.0 | 2.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 |
| 3 | 70637 | Flat in the Center with Terrace | 358842 | Leopoldstadt | 48.217600 | 16.380180 | Private room | 50 | 2 | 333 | <b>The space</b><br />My apartment (including ... | 2 | 1.0 | 2.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 |
| 4 | 75471 | nice big apartment with balcony | 363315 | Ottakring | 48.222070 | 16.315940 | Entire home/apt | 77 | 3 | 0 | you will like my beautiful apartment with balc... | 4 | 2.0 | 2.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10954 | 53711307 | Designer Apartment | 8 mins to U3 Ottakring | 64601861 | Ottakring | 48.208908 | 16.321311 | Entire home/apt | 69 | 4 | 139 | ONLY BUSINESS TRAVELLERS ARE ALLOWED UNTIL DEC... | 2 | 1.0 | 1.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 |
| 10955 | 53713303 | XII-JNBP New Apartment top 12 | 120675973 | Ottakring | 48.213201 | 16.324996 | Entire home/apt | 69 | 2 | 171 | The Apartment has a spacious living room with ... | 4 | 1.0 | 2.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 |
| 10956 | 53715433 | Private 1BR with shared bathroom | 404652017 | Ottakring | 48.206001 | 16.328201 | Private room | 53 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 |
| 10957 | 53716202 | adorable 1BR with shared bathroom | 404652017 | Ottakring | 48.206153 | 16.328158 | Private room | 54 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 |
| 10958 | 53716591 | Schöne Suite near Wiener Stadthalle | 404652017 | Ottakring | 48.205422 | 16.327949 | Private room | 55 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 |
10959 rows × 20 columns
Check is some nulls were generated.
df.price_avg_m2.isna().sum()
#df[df.price_avg_m2.isna()]
0
pd.options.display.max_columns = None
We have identified the necessary KPIs for the analysyi. We check if we have all of them in the proper format.
Price
The documentation does not clarify whether the price is for the entire property, or if a room is rented on a per room basis.
This is a key data to be able to make the valuation of the potential income of a property.
Let's try to understand it by analyzing the average price by type of rental.
It is important to filter by only one district in order not to include the "area" effect.
So first we choose a district that has a lot of data.
df.district.value_counts()
Leopoldstadt 1186 Landstraße 985 Rudolfsheim-Fünfhaus 822 Alsergrund 664 Favoriten 653 Neubau 648 Innere Stadt 606 Margareten 591 Ottakring 557 Mariahilf 517 Wieden 451 Meidling 448 Brigittenau 410 Josefstadt 403 Donaustadt 372 Währing 315 Hernals 304 Penzing 278 Döbling 231 Floridsdorf 178 Hietzing 136 Simmering 132 Liesing 72 Name: district, dtype: int64
df.loc[df.district == 'Leopoldstadt',:].groupby('room_type').price.mean()
room_type Entire home/apt 85.680258 Hotel room NaN Private room 45.223140 Shared room 44.833333 Name: price, dtype: float64
Conclusion:
However, multiplying the price by the total number of rooms may result into an overestimation of occupancy, since all rooms of an appartaments will not be reted at the same time.
Therefore we should weight it by the average percentage of rooms rented. We do not have that data, but let's assume that we have talked to the business manager and he has told us that it is 70%.
We can create the total price variable by applying apply on a custom function.
def create_total_price(record):
if (record.beds > 1) & ((record.room_type == 'Private room') | (record.room_type == 'Shared room')):
result = record.price * record.beds * 0.7
else:
result = record.price
return(result)
df['rental_price'] = df.apply(create_total_price, axis = 1)
Check
df[['room_type','price','beds','rental_price']].head(10)
| room_type | price | beds | rental_price | |
|---|---|---|---|---|
| 0 | Entire home/apt | 66 | 3.0 | 66.0 |
| 1 | Entire home/apt | 156 | 4.0 | 156.0 |
| 2 | Entire home/apt | 62 | 2.0 | 62.0 |
| 3 | Private room | 50 | 2.0 | 70.0 |
| 4 | Entire home/apt | 77 | 2.0 | 77.0 |
| 5 | Entire home/apt | 65 | 2.0 | 65.0 |
| 6 | Entire home/apt | 98 | 2.0 | 98.0 |
| 7 | Entire home/apt | 101 | 4.0 | 101.0 |
| 8 | Entire home/apt | 52 | 1.0 | 52.0 |
| 9 | Entire home/apt | 82 | 4.0 | 82.0 |
Occupancy
The variable we have that allows us to measure this is availability_365.
This variable tells us the number of days a year ahead that the property is NOT occupied. Therefore we would be interested in transforming it to a more direct measure of occupancy, for example the % of the year that it IS occupied.
df['occupancy'] = ((365 - df.availability_365) / 365 * 100).astype('int')
df.head()
| id | name | host_id | neighbourhood | latitude | longitude | room_type | price | minimum_nights | availability_365 | description | accommodates | bedrooms | beds | district | price_0_50 | price_51_80 | price_81_129 | price_130 | price_avg_m2 | rental_price | occupancy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Leopoldstadt | 48.21924 | 16.37831 | Entire home/apt | 66 | 3 | 124 | 39m² apartment with beautiful courtyard of the... | 5 | 1.0 | 3.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 | 66.0 | 66 |
| 1 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Rudolfsheim-Fünfhaus | 48.18434 | 16.32701 | Entire home/apt | 156 | 1 | 306 | Welcome to my Apt. 1!<br /><br />This is a 2be... | 6 | 2.0 | 4.0 | Rudolfsheim-Fünfhaus | 5902 | 6745 | 6220 | 8132 | 6509 | 156.0 | 16 |
| 2 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Leopoldstadt | 48.21778 | 16.37847 | Entire home/apt | 62 | 3 | 136 | small studio in new renovated old house and ve... | 3 | 1.0 | 2.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 | 62.0 | 62 |
| 3 | 70637 | Flat in the Center with Terrace | 358842 | Leopoldstadt | 48.21760 | 16.38018 | Private room | 50 | 2 | 333 | <b>The space</b><br />My apartment (including ... | 2 | 1.0 | 2.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 | 70.0 | 8 |
| 4 | 75471 | nice big apartment with balcony | 363315 | Ottakring | 48.22207 | 16.31594 | Entire home/apt | 77 | 3 | 0 | you will like my beautiful apartment with balc... | 4 | 2.0 | 2.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 | 77.0 | 100 |
We are going to discretize and analyze some of the variables: accommodates, bedrooms and beds.
Discretize bedrooms
First we evaluate the distribution of the bedrooms.
df.bedrooms.value_counts().plot.bar();
Vamos a discretizar para 1,2,3 y más de 3. We discretize one, two, three or more than three bedrooms.
conditions = [df.bedrooms == 1,
df.bedrooms == 2,
df.bedrooms == 3,
df.bedrooms > 3]
results = ['01_One','02_Two','03_Three','04_Four or more']
df['bedrooms_disc'] = np.select(conditions, results, default = -999)
Check
df.bedrooms_disc.value_counts().plot.bar();
Discretizae accommodates and beds
We discretize the values according the percentiles.
df['accommodates_disc'] = pd.qcut(df.accommodates,[0, 0.5, 0.8, 1],
labels = ['0-2','3','4-16'])
df['accommodates_disc'].value_counts().sort_index(ascending = False).plot.barh();
Discretize beds
conditions = [df.beds == 1,
df.beds == 2,
df.beds == 3,
df.beds == 4,
df.beds >= 5]
results = ['1','2','3','4','5-28']
df['beds_disc'] = np.select(conditions, results, default = -999)
df.bedrooms_disc.value_counts().plot.barh();
We know the purchase price per squared meter depending on the number of squred meters and district of the poroperty.
We also know the rental price according to the number of beds.
In order to find the square meters of the rental properties, first we estimate the number of squared meters with the number of rooms. After that, we compute the purchase price of the property, multiplying the squared meetrs with its price per squared meter.
Estimation of the square meters of the property
We assume the squared meters according to the number of rooms:
n_rooms = [df.bedrooms == 1,
df.bedrooms == 2,
df.bedrooms == 3,
df.bedrooms == 4,
df.bedrooms > 4]
est_m2 = [50,70,100,120,140]
df['m2'] = np.select(n_rooms, est_m2, default = -999)
Check
df['m2'].value_counts()
50 8384 70 1962 100 449 140 83 120 81 Name: m2, dtype: int64
Now we can estimate the purchase price of the property.
We recall that we took 25% off the price that we get for negotiation capacity.
df
| id | name | host_id | neighbourhood | latitude | longitude | room_type | price | minimum_nights | availability_365 | description | accommodates | bedrooms | beds | district | price_0_50 | price_51_80 | price_81_129 | price_130 | price_avg_m2 | rental_price | occupancy | bedrooms_disc | accommodates_disc | beds_disc | m2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38768 | central cityapartement- wifi- nice neighbourhood | 166283 | Leopoldstadt | 48.219240 | 16.378310 | Entire home/apt | 66 | 3 | 124 | 39m² apartment with beautiful courtyard of the... | 5 | 1.0 | 3.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 | 66.0 | 66 | 01_One | 4-16 | 3 | 50 |
| 1 | 40625 | Near Palace Schönbrunn, Apt. 1 | 175131 | Rudolfsheim-Fünfhaus | 48.184340 | 16.327010 | Entire home/apt | 156 | 1 | 306 | Welcome to my Apt. 1!<br /><br />This is a 2be... | 6 | 2.0 | 4.0 | Rudolfsheim-Fünfhaus | 5902 | 6745 | 6220 | 8132 | 6509 | 156.0 | 16 | 02_Two | 4-16 | 4 | 70 |
| 2 | 51287 | little studio- next to citycenter- wifi- nice ... | 166283 | Leopoldstadt | 48.217780 | 16.378470 | Entire home/apt | 62 | 3 | 136 | small studio in new renovated old house and ve... | 3 | 1.0 | 2.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 | 62.0 | 62 | 01_One | 0-2 | 2 | 50 |
| 3 | 70637 | Flat in the Center with Terrace | 358842 | Leopoldstadt | 48.217600 | 16.380180 | Private room | 50 | 2 | 333 | <b>The space</b><br />My apartment (including ... | 2 | 1.0 | 2.0 | Leopoldstadt | 6747 | 7204 | 7898 | 9405 | 7460 | 70.0 | 8 | 01_One | 0-2 | 2 | 50 |
| 4 | 75471 | nice big apartment with balcony | 363315 | Ottakring | 48.222070 | 16.315940 | Entire home/apt | 77 | 3 | 0 | you will like my beautiful apartment with balc... | 4 | 2.0 | 2.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 | 77.0 | 100 | 02_Two | 3 | 2 | 70 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10954 | 53711307 | Designer Apartment | 8 mins to U3 Ottakring | 64601861 | Ottakring | 48.208908 | 16.321311 | Entire home/apt | 69 | 4 | 139 | ONLY BUSINESS TRAVELLERS ARE ALLOWED UNTIL DEC... | 2 | 1.0 | 1.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 | 69.0 | 61 | 01_One | 0-2 | 1 | 50 |
| 10955 | 53713303 | XII-JNBP New Apartment top 12 | 120675973 | Ottakring | 48.213201 | 16.324996 | Entire home/apt | 69 | 2 | 171 | The Apartment has a spacious living room with ... | 4 | 1.0 | 2.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 | 69.0 | 53 | 01_One | 3 | 2 | 50 |
| 10956 | 53715433 | Private 1BR with shared bathroom | 404652017 | Ottakring | 48.206001 | 16.328201 | Private room | 53 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 | 53.0 | 96 | 01_One | 0-2 | 1 | 50 |
| 10957 | 53716202 | adorable 1BR with shared bathroom | 404652017 | Ottakring | 48.206153 | 16.328158 | Private room | 54 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 | 54.0 | 96 | 01_One | 0-2 | 1 | 50 |
| 10958 | 53716591 | Schöne Suite near Wiener Stadthalle | 404652017 | Ottakring | 48.205422 | 16.327949 | Private room | 55 | 1 | 13 | Eine hübsches Zimmer mit Gemeinschaftsbad in d... | 2 | 1.0 | 1.0 | Ottakring | 5590 | 5961 | 5418 | 5778 | 5587 | 55.0 | 96 | 01_One | 0-2 | 1 | 50 |
10959 rows × 26 columns
def set_price(record):
if record.bedrooms == 1:
result= record.price_0_50*record.m2
elif record.bedrooms == 2:
result= record.price_51_80*record.m2
elif record.bedrooms == 3:
result= record.price_81_129*record.m2
elif record.bedrooms == 4:
result= record.price_130*record.m2
elif record.bedrooms >= 5:
result= 180*record.price_avg_m2*record.m2
else:
result= -999
return result*0.75
df['purchase_price'] = df.apply(set_price, axis = 1)
Check
df[['bedrooms','m2','district','purchase_price']].head(20)
| bedrooms | m2 | district | purchase_price | |
|---|---|---|---|---|
| 0 | 1.0 | 50 | Leopoldstadt | 253012.5 |
| 1 | 2.0 | 70 | Rudolfsheim-Fünfhaus | 354112.5 |
| 2 | 1.0 | 50 | Leopoldstadt | 253012.5 |
| 3 | 1.0 | 50 | Leopoldstadt | 253012.5 |
| 4 | 2.0 | 70 | Ottakring | 312952.5 |
| 5 | 2.0 | 70 | Brigittenau | 298147.5 |
| 6 | 1.0 | 50 | Neubau | 280237.5 |
| 7 | 2.0 | 70 | Rudolfsheim-Fünfhaus | 354112.5 |
| 8 | 1.0 | 50 | Hernals | 197137.5 |
| 9 | 2.0 | 70 | Rudolfsheim-Fünfhaus | 354112.5 |
| 10 | 1.0 | 50 | Floridsdorf | 263437.5 |
| 11 | 1.0 | 50 | Leopoldstadt | 253012.5 |
| 12 | 1.0 | 50 | Leopoldstadt | 253012.5 |
| 13 | 1.0 | 50 | Leopoldstadt | 253012.5 |
| 14 | 1.0 | 50 | Ottakring | 209625.0 |
| 15 | 1.0 | 50 | Hernals | 197137.5 |
| 16 | 1.0 | 50 | Leopoldstadt | 253012.5 |
| 17 | 1.0 | 50 | Alsergrund | 302475.0 |
| 18 | 1.0 | 50 | Leopoldstadt | 253012.5 |
| 19 | 1.0 | 50 | Simmering | 242625.0 |
pd.options.display.float_format = '{:.2f}'.format # Remove scientific notation
df.rental_price.describe()
count 10959.00 mean 88.29 std 174.26 min 20.00 25% 43.00 50% 62.00 75% 94.50 max 9270.00 Name: rental_price, dtype: float64
There is an outlier in the upper part. We select the median instead of the mean to ignore its effect.
df.rental_price.median()
62.0
df.groupby('district').rental_price.median().sort_values(ascending = False)
district Innere Stadt 125.00 Mariahilf 70.00 Liesing 69.00 Landstraße 68.00 Wieden 67.00 Josefstadt 66.00 Neubau 65.00 Leopoldstadt 63.00 Floridsdorf 62.50 Meidling 60.00 Margareten 60.00 Alsergrund 60.00 Favoriten 60.00 Hietzing 59.50 Brigittenau 59.00 Donaustadt 59.00 Währing 59.00 Döbling 57.00 Rudolfsheim-Fünfhaus 56.00 Hernals 55.00 Penzing 55.00 Simmering 52.60 Ottakring 49.00 Name: rental_price, dtype: float64
All values seem reasonable.
temp = df.groupby('district')[['rental_price','purchase_price']].median()
temp
| rental_price | purchase_price | |
|---|---|---|
| district | ||
| Alsergrund | 60.00 | 302475.00 |
| Brigittenau | 59.00 | 205650.00 |
| Donaustadt | 59.00 | 283500.00 |
| Döbling | 57.00 | 297525.00 |
| Favoriten | 60.00 | 252787.50 |
| Floridsdorf | 62.50 | 263437.50 |
| Hernals | 55.00 | 197137.50 |
| Hietzing | 59.50 | 253462.50 |
| Innere Stadt | 125.00 | 731625.00 |
| Josefstadt | 66.00 | 324187.50 |
| Landstraße | 68.00 | 299325.00 |
| Leopoldstadt | 63.00 | 253012.50 |
| Liesing | 69.00 | 235800.00 |
| Margareten | 60.00 | 238575.00 |
| Mariahilf | 70.00 | 319875.00 |
| Meidling | 60.00 | 211837.50 |
| Neubau | 65.00 | 280237.50 |
| Ottakring | 49.00 | 209625.00 |
| Penzing | 55.00 | 224662.50 |
| Rudolfsheim-Fünfhaus | 56.00 | 221325.00 |
| Simmering | 52.60 | 242625.00 |
| Wieden | 67.00 | 370312.50 |
| Währing | 59.00 | 294975.00 |
plt.figure(figsize = (16,8))
sns.scatterplot(data = temp, x = 'purchase_price', y = 'rental_price')
for cada in range(0,temp.shape[0]):
plt.text(temp.purchase_price[cada], temp.rental_price[cada], temp.index[cada])
The Innere Stadt district has the highest rental and purchase prices. We zoom in the other districts.
temp2=temp.drop(index='Innere Stadt')
plt.figure(figsize = (16,10))
sns.scatterplot(data = temp2, x = 'purchase_price', y = 'rental_price')
for cada in range(0,temp2.shape[0]):
plt.text(temp2.purchase_price[cada], temp2.rental_price[cada], temp2.index[cada])
Conclusion:
With a similar purchase price, there are district with a hiogher rental price. For example, Mariahilfer has a similar purchase price of Josefstadt, but the profitability is higher.
With an elevated budget, could pay of to buy several properties in the culter of districts rather than one dindle property in Innere Stadt.
What factors (other than location) determine the rental price?
metricas = ['rental_price','purchase_price']
dimensiones = ['bedrooms_disc','accommodates_disc', 'beds_disc']
minicubo_precio = df[dimensiones + metricas]
minicubo_precio
| bedrooms_disc | accommodates_disc | beds_disc | rental_price | purchase_price | |
|---|---|---|---|---|---|
| 0 | 01_One | 4-16 | 3 | 66.00 | 253012.50 |
| 1 | 02_Two | 4-16 | 4 | 156.00 | 354112.50 |
| 2 | 01_One | 0-2 | 2 | 62.00 | 253012.50 |
| 3 | 01_One | 0-2 | 2 | 70.00 | 253012.50 |
| 4 | 02_Two | 3 | 2 | 77.00 | 312952.50 |
| ... | ... | ... | ... | ... | ... |
| 10954 | 01_One | 0-2 | 1 | 69.00 | 209625.00 |
| 10955 | 01_One | 3 | 2 | 69.00 | 209625.00 |
| 10956 | 01_One | 0-2 | 1 | 53.00 | 209625.00 |
| 10957 | 01_One | 0-2 | 1 | 54.00 | 209625.00 |
| 10958 | 01_One | 0-2 | 1 | 55.00 | 209625.00 |
10959 rows × 5 columns
minicubo_precio = minicubo_precio.melt(id_vars=['purchase_price','rental_price'])
minicubo_precio
| purchase_price | rental_price | variable | value | |
|---|---|---|---|---|
| 0 | 253012.50 | 66.00 | bedrooms_disc | 01_One |
| 1 | 354112.50 | 156.00 | bedrooms_disc | 02_Two |
| 2 | 253012.50 | 62.00 | bedrooms_disc | 01_One |
| 3 | 253012.50 | 70.00 | bedrooms_disc | 01_One |
| 4 | 312952.50 | 77.00 | bedrooms_disc | 02_Two |
| ... | ... | ... | ... | ... |
| 32872 | 209625.00 | 69.00 | beds_disc | 1 |
| 32873 | 209625.00 | 69.00 | beds_disc | 2 |
| 32874 | 209625.00 | 53.00 | beds_disc | 1 |
| 32875 | 209625.00 | 54.00 | beds_disc | 1 |
| 32876 | 209625.00 | 55.00 | beds_disc | 1 |
32877 rows × 4 columns
minicubo_precio = minicubo_precio.groupby(['variable','value'])[['rental_price','purchase_price']].agg('median')
minicubo_precio
| rental_price | purchase_price | ||
|---|---|---|---|
| variable | value | ||
| accommodates_disc | 0-2 | 50.00 | 280237.50 |
| 3 | 75.00 | 297525.00 | |
| 4-16 | 106.00 | 403935.00 | |
| bedrooms_disc | 01_One | 55.00 | 253012.50 |
| 02_Two | 90.00 | 378210.00 | |
| 03_Three | 128.00 | 577950.00 | |
| 04_Four or more | 186.00 | 107399250.00 | |
| beds_disc | 1 | 50.00 | 280237.50 |
| 2 | 68.60 | 283500.00 | |
| 3 | 90.00 | 350857.50 | |
| 4 | 95.00 | 354112.50 | |
| 5-28 | 139.00 | 497700.00 |
We analize each variable of the minicube.
minicubo_precio.loc['bedrooms_disc']
| rental_price | purchase_price | |
|---|---|---|
| value | ||
| 01_One | 55.00 | 253012.50 |
| 02_Two | 90.00 | 378210.00 |
| 03_Three | 128.00 | 577950.00 |
| 04_Four or more | 186.00 | 107399250.00 |
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['bedrooms_disc'].rental_price)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['bedrooms_disc'].purchase_price,color = 'green');
Conclusion: one bedroom is not optimal.
minicubo_precio.loc['accommodates_disc']
| rental_price | purchase_price | |
|---|---|---|
| value | ||
| 0-2 | 50.00 | 280237.50 |
| 3 | 75.00 | 297525.00 |
| 4-16 | 106.00 | 403935.00 |
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['accommodates_disc'].rental_price)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['accommodates_disc'].purchase_price,color = 'green');
Conclusion: the optimal number of accomodates is 3.
minicubo_precio.loc['beds_disc']
| rental_price | purchase_price | |
|---|---|---|
| value | ||
| 1 | 50.00 | 280237.50 |
| 2 | 68.60 | 283500.00 |
| 3 | 90.00 | 350857.50 |
| 4 | 95.00 | 354112.50 |
| 5-28 | 139.00 | 497700.00 |
f, ax = plt.subplots()
ax.plot(minicubo_precio.loc['beds_disc'].rental_price)
ax2 = ax.twinx()
ax2.plot(minicubo_precio.loc['beds_disc'].purchase_price,color = 'green');
Conclusion: the rental price (blue) grows with the numer of beds. The purchase price (green) also grows, but the profitability is smaller when the number of beds is three.
We plot the rental price per property on a map.
import folium
#data = df[df.district == 'Liesing'].copy()
data = df.copy()
data['rental_price_disc'] = pd.qcut(data['rental_price'], q = [0, .25, .5, .75, 1.],
labels=['yellow', 'orange', 'blue', 'red'])
my_map=folium.Map(location=[48.2082, 16.3738] ,zoom_start=11.5)
for piso in range(0,len(data)):
folium.CircleMarker(
location = [data.iloc[piso]['latitude'], data.iloc[piso]['longitude']],
popup = data.iloc[piso]['rental_price'],
fill=True,
color = data.iloc[piso]['rental_price_disc'],
fill_opacity=1,
radius = 2
).add_to(my_map)
my_map